What Can A Multimodal Language Model Do